14.2 The Naïve Bayes Classifier
Introduction to Conditional Probabilities
Download the file from the link below to follow along with the text example or video and to practice on your own.
In this example, we will be using discrete probabilities. First, let’s start with simple probabilities, and then we will develop the Naïve Bayes theorem.
Assume the throw of a die. Since there are six sides to a die, the probability of rolling any number between 1 and 6 is the same and is 1/6. And since the probability is the same for each side, the probability distribution is an even distribution or also referred to as a uniform distribution. For example:
Sometimes it is necessary to measure probability of an outcome based on two independent events. As an example, let’s assume we have a white die and a black die. We want to know the probability of rolling a 1 on the black die and a 1 on the white die. This probability is:
Notice that the joint probability of p(w=1, b=1) is simply the product of each individual probability. (Note: This result is consistent with the density function of rolling two dice as shown in Figure 10.11.) A joint probability is the same as the “And” condition in Boolean algebra. In other words, it is the probability of A occurring AND B occurring. Stated in general terms:
Also note that joint probabilities are commutative.
In many situations, however, the two events are not completely independent of each other. If we have a deck of cards, and we want to know the probability of drawing an ace, the probability is 4/52, since there are 4 aces in a deck of cards. Another question we might ask is, what is the probability of drawing two aces? In this case, the probability of drawing the second ace is not 4/52 because there are now only 3 aces in the deck and only 51 cards in the deck. So, this probability would be
This can be read, probability of an ace AND the probability of an ace GIVEN an ace is already drawn. In generalized notation we write:
This is read, “probability of A and B given A is equal to the probability of A times probability of B given A is known or has occurred.” This is called conditional probability, or better yet, a joint conditional probability, because the probability is measured as a condition of a previous event or condition.
From this equation and from the commutative property, we can develop the Bayes equation. (Note: Part of the proof of this equation, which we so simply present, is that conditional probabilities are indeed commutative. In our presentation, we take that statement as a given.)
One other notation that is important is that either A or B can have multiple values or features, so for example, if B = x1, x2, x3, . . . then the equation becomes
Making a simplification, assuming that the x’s are always independent, allows us to use the equation for classification, but then we identify it as the Naïve Bayesian classifier. It is Naïve because of this assumption that the x’s are independent. Amazingly, even though the assumption does not always hold true, the Naïve Bayes classifier is very accurate.
Naïve Bayes Classifier
Thomas Bayes, an 18th-century British mathematician, created this formula for determining the relationship between the probabilities of two events and their conditional probabilities. Again, a conditional probability is the likelihood of an outcome occurring based on a previous outcome occurring. Bayes’ theorem is a way of explaining the probability of an outcome occurring based on prior knowledge that may be related to the outcome.
The Naïve Bayes classifier tweaks Bayes’ theorem by assuming that the presence or absence of a particular feature of a class is unrelated to the presence or absence of other features, or in other words, that all of the features contribute independently to the probability that the object is part of a certain class. If the features of an object are shape, color, and weight, and the object is spherical, black and white, and about 430 grams, it may be a soccer ball, but a spherical shape does not influence its color.
The features (input variables) for the Naïve Bayes classifier are categories of data (categorical variables), or they can be converted from continuous values to categorical values by creating a histogram of the continuous values. In the ball example, grams could be made discrete by defining levels of weight, such as less than 300 grams = “small,” between 300 and 400 = “medium,” and 400 to 445 = “large.”
The Naïve Bayes classifier has many strengths. It can handle missing values, is robust to irrelevant variables, is computationally efficient, and can handle high-dimensional data (with many variables). It is the most common method for text classification such as spam filtering. As indicated above, it is used for categorical data.
The steps in the process for doing the Naïve Bayes method are very straightforward, just counting, dividing, and multiplying.
-
Estimate the individual conditional probabilities, or likelihoods, for each feature. The likelihood is estimated by the proportion of the feature values among the entire dataset.
-
Multiply these feature likelihoods for each class by each other and then by the class probability for each class and divide by the product of the overall feature likelihoods. In other words, apply the Naive Bayesian equation for each class.
-
Assign each new record to the correct class. The correct class has the highest probability for this set of feature values. Even though the output of a Naïve Bayes classifier is a probability score, it is not the true probability, but is proportional to the true probability.
Amazingly, even if some of the events are not truly independent, the Naïve Bayes classifier usually works really well.
An Example of Filtering Spam Emails
As an example, let’s assume an email arrives and we want to assign it to a class. The two classes of interest are Normal email and Spam email. To utilize the Naïve Bayesian formula, we use previous emails as training data to calculate the likelihoods for the equation. The feature that we are testing against is a set of words that we have determined are good indicators of Normal versus Spam. We test this new email against the learned algorithm.
Figure 14.1 shows the training data. There are 50 emails that are used to train the algorithm and establish the likelihood of any word occurring. There are 5 words that are tested, with a total of 70 instances of those words in the training emails, 50 instances in normal emails and 20 instances in spam email. Figure 14.1 also shows the distribution of word instances, and in the rightmost column it shows the likelihood of each word appearing. In this instance, we use the word likelihood to mean the calculated probabilities as determined from the training data.
In our evaluation, there is an input email that has two words in it, namely “Happy” and “Congratulations”. We want to test whether this email should be classified as Normal or Spam. To do this, we will test it as normal and derive a probability, then we will test it as spam and derive a probability. Whichever class has the highest probability is the correct class. Here are the two equations, where N = Normal and S = Spam and Congratulations is abbreviated to “Congrat”.
We read the top equation thusly: The probability of a normal email given the words “Happy” and “Congrat” is equal to the prior probability of receiving a normal email times the likelihood of the word “Happy” in a normal email times the likelihood of the word “Congrat” in a normal email divided by the likelihood of the word “Happy” times the likelihood of the word “Congrat”.
Next, let’s calculate the likelihoods of the words for a normal email. Figure 14.2 shows those likelihoods.
In Figure 14.2 the overall word likelihoods are also included. Using that data, we can write the Bayesian equation to calculate the probability that the new email is a normal email.
Next, we do the same calculation for these two words in a spam email. Figure 14.3 shows the likelihoods of the words in spam emails.
And the corresponding equation is:
Next, we compare the two probabilities. An email that includes the two words “Happy” and “Congratulations” has a 65% probability of being normal and a 27% probability of being spam. Thus, we classify this email as Normal.
As an exercise, take a few minutes and calculate the probability scores for Normal and Spam for a new email that contains the two words “Promotion” and “Congratulations”. You should get P(Normal) = .173 and P(Spam) = .867. This email would be classified as Spam.
This technique can also be applied to emails that have one test word or many test words. In our example we used two, but it can also be applied to emails that contain many test words. The ability to test multiple words depends on the training data. In our example, we had only five words that were used. In real situations, the algorithm may use 10 or 15 or even 100 key words. In our example we also only had 50 training emails, but in a real situation thousands of documents may be used to create the estimated likelihoods for word frequency. A more detailed example is provided in the section on Text Classification.
Additional Considerations Using the Naïve Bayes Algorithm
In computing the probabilities, there are two important problems to handle: numerical underflow and rare events. Because the calculation is based on multiplying several probability values that are close to zero, the outcome values are very small, and we run into the problem of numerical underflow, where the computer runs out of decimal places in memory. To alleviate this, we often compute the logarithm of the products so the output is usually the log probability, and the class is still assigned based on the highest probability value.
The other problem is that if one of the feature values does not appear with one of the class labels in the training set, the conditional probability for that feature will be zero, which will create an outcome class probability of zero because we multiply all of the conditional probabilities together. To alleviate this, we use a smoothing technique such as Laplace smoothing where we add one (or a very small number like 0.01) to every value in the entire model so that there are no zero or empty conditional probabilities.
Adding this small value does not really affect the posterior probability estimates because we rank the probabilities and assign the class to the highest probability. We don’t report the actual probability number or use it in any other calculation.